Prior to this lecture, you should have read chapters 2 and 3 of Regression and Other Stories.
All of these data sets have the same average and standard deviation for their x and y variables, and the same correlation between x and y.
A histogram is a common and useful way to visualize the distribution of a continuous variable. A histogram is a special bar chart where the categories along the x-axis are bins (i.e. ranges) of values for a variable, and the heigth of the bars indicates the number of observations within each bin.
If you have a data set loaded into R as a data frame, you can quickly
create a histogram using the geom_histogram() function
within ggplot(). In this example, I have a data frame
called crashes and one of the variables in that data frame
is called crashes_per_km2. The bins argument
indicates the number of bins I want in my histogram.
ggplot(crashes) +
geom_histogram(aes(x = crashes_per_km2),
color = "black",
fill = "pink",
bins = 30) +
theme_minimal()
You can also create a histogram in Excel. You need to define a set of threshold values for your bins, then use the “Histogram” tool in the “Data Analysis” toolpack, which you’ll find on the “Data” ribbon if you’ve enabled it.
A jittered (or one-dimensional) scatter plot can be another good way
to visualize the distribution of a variable. Making your points very
small (size = 0.5) and transparent
(alpha = 0.5) can be helpful if you have a lot of points
plotted on top of one another (a problem called overplotting).
ggplot(crashes) +
geom_point(aes(x = crashes_per_km2, y = 0),
position = "jitter",
size = 0.5,
alpha = 0.5) +
scale_y_continuous(name = "",
breaks = c()) +
theme_minimal()
A more typical use of a scatter plot is to visualize the relationship between two continuous variables. In this example, I want to see the relationship between population density (people per square kilometer) and crash density (crashes per square kilometer). It is standard to have your outcome variable on the y-axis.
ggplot(crashes) +
geom_point(aes(x = ppl_per_km2,
y = crashes_per_km2),
size = 0.5,
alpha = 0.5) +
theme_minimal()
It’s straightforward to create a plot like this in Excel as well.
By default, Excel will guess what values you want on your x and y axes, and it sometimes guesses wrong. I recommend clearing out all the data series and adding in the one you want manually to be sure that you’re plotting what you think you’re plotting.
You can describe a continuous variable in terms of its central tendency and its spread.
The most common measure of central tendency is an average, which is also called a mean.
If you have a data set loaded into R as a data frame, you can
calculate the average of one of the variables using the
mean() function. In this example, I have a data frame
called crashes and one of the variables in that data frame
is called crashes_per_km2.
mean(crashes$crashes_per_km2)
## [1] 368.2335
In Excel, you can use the AVERAGE() function.
One problem with using an average value to describe your data is that a single observation can have a big influence on the average if it has very high or very low values (we would describe this as an outlier). A median is an alternative measure of central tendency. Half of the observations in your dataset have values below the median, and half the values in your dataset have values above the median.
You can calculate a median with the median() function in
R or the MEDIAN() function in Excel.
Here are are histograms for two different variables with the same average and very different medians.
A common measure of spread for a continuous variable is its standard deviation.
You can calculate a standard deviation using the sd()
function in R, or the STDEV.P() function in Excel.
A smaller standard deviation means the values are less spread out. Here are histograms for two variables that have the same mean, but different standard deviations.
A standard deviation is related to the average difference between the value for an individual observation and the average value, and it is sensitive to outliers. A measure of spread that is less sensitive to outliers is the interquartile range, or the range of values between the 25th percentile (i.e. the value that 25 percent of your observations will be below) and the 75th percentile (i.e the value that 75 percent of your observations will be above).
In R, you can calculate percentile values using the
quantile(). Here’s how you would calculate the 25th, 50th,
and 75th percentiles (the 50th percentile is the median).
quantile(crashes$crashes_per_km2, probs = c(0.25, 0.5, 0.75))
## 25% 50% 75%
## 154.8500 324.3568 519.0817
In Excel, you can calculate percentile values using PERCENTILE.INC().
Here are two distributions that are very different in terms of mean and standard deviation, but similar in terms of median and interquartile range.
One more measure of the spread of a variable that is very sensitive to outliers is the range of the data: the difference between the minumum and maximum values.
You can calculate the minumum value of a variable using the
min() function in R or the MIN() function in Excel.
You can also quickly get a summary of a variable’s range,
interquartile range, mean, and median using the summary()
function in R.
summary(crashes$crashes_per_km2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34.93 154.85 324.36 368.23 519.08 1906.71
Both standard deviation and range are sensitive to outliers and extreme values, but they don’t measure the same property. Here are two distributions that have the same range, but different standard deviations.
In real life, it’s common for variables to fall into histograms that are shaped kind of like a bell (a bell curve). This is so common, that we call it a normal distribution. It looks something like this, with more values closer the the average value, and fewer values farther from the average. It’s symmetrical in the sense that values are as likely to be below average as above average (so the median will be almost the same as the mean).
The methods we discuss in this class are based on an assumption that your continuous variables have an approximately normal distribution. Sometimes they don’t. This one doesn’t.
When you have a lot of lower values and fewer higher values like this, you can call this a left-skewed distribution (because there are more values on the left side of the histogram than on the right). A left-skewed distribution might indicate a log-normal distribution, where the logarithms of the values approximate a normal distribution better than the values themselves do.
You might also have a bi-modal distribution like the one below. Where the histogram shows two distinct humps. This might suggest you have two different populations (or categories) in your data, and you might need to account for that in your analysis somehow.
Variation in a categorical variable is somewhat simpler. You’ve more or less described it fully when you’ve reported the proportion of your sample in each category.
If I have a data frame in R called commuting, with a
variable called mode to indicate the typical mode of
transportation to work, I can get the number of people who report using
each mode like this:
commuting %>%
group_by(mode) %>%
summarise(number = n())
## # A tibble: 8 × 2
## mode number
## <chr> <int>
## 1 Bike 8390
## 2 Car 715885
## 3 Ferry 656
## 4 Motorcycle 3011
## 5 Other 8555
## 6 Taxi 1460
## 7 Transit 42950
## 8 Walk 23893
And I can show that as the proportions by dividing by the total sample size (which would be the length of any of the data frame’s columns).
commuting %>%
group_by(mode) %>%
summarise(number = n()) %>%
mutate(proportion = number / length(commuting$mode))
## # A tibble: 8 × 3
## mode number proportion
## <chr> <int> <dbl>
## 1 Bike 8390 0.0104
## 2 Car 715885 0.890
## 3 Ferry 656 0.000815
## 4 Motorcycle 3011 0.00374
## 5 Other 8555 0.0106
## 6 Taxi 1460 0.00181
## 7 Transit 42950 0.0534
## 8 Walk 23893 0.0297
I can do something similar in Excel using a pivot table:
A bar chart is a reasonable way to visualize proportions. You can
create one in R using geom_bar().
commuting %>%
group_by(mode) %>%
summarise(number = n()) %>%
mutate(proportion = number / length(commuting$mode)) %>%
ggplot() +
geom_bar(aes(x = mode, y = proportion),
stat = "identity") +
scale_y_continuous(breaks = breaks <- seq(0, 1, by = 0.1),
labels = paste0(breaks * 100, "%")) +
theme_minimal()
And you can do something similar in Excel.
When you have some very small categories, a tree map can be more legible than a bar chart.
library(treemapify)
commuting %>%
group_by(mode) %>%
summarise(number = n()) %>%
mutate(proportion = number / length(commuting$mode)) %>%
ggplot() +
geom_treemap(aes(area = proportion, fill = mode)) +
geom_treemap_text(aes(area = proportion, label = mode),
color = "white") +
scale_fill_brewer(type = "qual",
guide = "none") # leave off the legend
And you can do something similar in Excel.